Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce timeout for hanging workspaces #605

Merged
merged 4 commits into from
Sep 30, 2021

Conversation

amisevsk
Copy link
Collaborator

@amisevsk amisevsk commented Sep 21, 2021

Note: this PR depends on #598 and so for now includes those changes as well; ignore all but the last three commits (i.e. 05c0441 to the end). Once #598 is merged, this commit will be a lot simpler.

What does this PR do?

Introduces config setting .workspace.startProgressTimeout that controls how long a workspace can be in the starting phase without progressing before it is automatically failed.

"Progressing" in our case means "updating workspace .status.conditions" -- if the last condition change (checked by lastTransitionTime) is more than idleTimeout ago, we assume the workspace has failed.

Default timeout value is 5 minutes, to allow for e.g. an incredibly slow image pull.

What issues does this PR fix or reference?

Closes #578

Is it tested? How?

  1. Create or update the DevWorkspaceOperatorConfig to ignore FailedScheduling events (and optionally set a shorter idleTimeout):
    kubectl patch dwoc devworkspace-operator-config --type merge -p \
      '{
        "config": {
          "workspace": {
            "ignoredUnrecoverableEvents": [
              "FailedScheduling"
            ],
            "startProgressTimeout": "30s"
          }
        }
      }'
  2. Create a DevWorkspace that cannot be scheduled:
    cat <<EOF | kubectl apply -f 
    kind: DevWorkspace
    apiVersion: workspace.devfile.io/v1alpha2
    metadata:
      name: test-devworkspace
    spec:
      started: true
      template:
        components:
          - name: tooling
            container:
              image: quay.io/wto/web-terminal-tooling
              args: ["tail", "-f", "/dev/null"]
              memoryRequest: 100Gi
              memoryLimit: 100Gi
    EOF
  3. Wait 30 seconds; workspace should fail to start.

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

Add setting workspace.startProgressTimeout to denote maximum duration
for any workspace phase before the workspace start is failed. Default
value is 5m

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Signed-off-by: Angel Misevski <amisevsk@redhat.com>
// a "Starting" phase without progressing before it is automatically failed.
// Duration should be specified in a format parseable by Go's time package, e.g.
// "15m", "20s", "1h30m", etc. If not specified, the default value of "5m" is used.
StartProgressTimeout string `json:"startProgressTimeout,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this might make more sense to be something like StartPhaseTimeout? WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed to ProgressTimeout to include Failing dws and forgot this comment -- sorry. Are you okay with that name? My thinking is it's the max duration a DevWorkspace can exist without progressing

Copy link
Contributor

@JPinkney JPinkney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested and everything worked as expected

Copy link
Member

@sleshchenko sleshchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I just have some question, which I may find by myself during testing, but I'm not there yet.

controllers/workspace/status.go Outdated Show resolved Hide resolved
@@ -355,6 +376,9 @@ func (r *DevWorkspaceReconciler) Reconcile(ctx context.Context, req ctrl.Request
}
reqLogger.Info("Waiting on deployment to be ready")
reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for workspace deployment")
if !deploymentStatus.Requeue && deploymentStatus.Err == nil {
return reconcile.Result{RequeueAfter: startingWorkspaceRequeueInterval}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does here the last or the first internal win?

  • I mean we wait for deployment initially and scheduled reconciling in 5 secs (1);
  • Deployment created pods and it invokes reconcile loop, we scheduling in 5 secs (2);
  • Pod is updated several times after containers status are propagates, we scheduled N times in 5 secs(N);

The question is, how many times reconcile loops our RequeueAfter will initiate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that reconciling due to an event cancels out a requeueAfter, but it's hard to check for sure. In my testing, I don't see bursts of "waiting on deployment" in the logs, which is what would likely happen if we were stacking requeues.

@@ -59,6 +59,10 @@ import (
dw "github.com/devfile/api/v2/pkg/apis/workspaces/v1alpha2"
)

const (
startingWorkspaceRequeueInterval = 5 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if 5 secs is always the best choice or we can make it dynamic and like to minute, 2,3 in case of 5minutes timeout?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly picked 5 seconds because it feels long enough that it's not burdening the controller (we can reconcile many times a second) but also short enough to avoid strange issues in setting a timeout (e.g. if we set a 1 minute requeue, what happens if the config specifies a timeout of 1 minute 15 seconds?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 minute 15 seconds

I don't think there is a case for such a precise timeout. Maybe we can declare our precision to like 1minute or 30 seconds.

And actually, I thought about reconciling after (timeout + last transition time - now + 5 sec), which will initiate reconcile loop after 5 sec when potentially we need to kill that.

But I'm OK with any if it does not generate redundant loading.

Copy link
Collaborator Author

@amisevsk amisevsk Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this informally by starting a workspace that will timeout (ignore FailedScheduling events and ask for 100Gi in Theia). While the workspace is looping on checking the deployment every 5 seconds, I ran

for i in {1..5}; do
  kubectl patch dw theia-next --type merge -p '{"metadata": {"labels": {"touch": "false"}}}'
  sleep 0.5
  kubectl patch dw theia-next --type merge -p '{"metadata": {"labels": {"touch": "true"}}}'
  sleep 0.5
done

to trigger 10 reconciles to the object, with the assumption that each of those would also start a RequeueAfter. However, after the 10 quick reconciles are completed, the controller goes back to queuing reconciles every five seconds, rather than 10 reconciles every 5 seconds, so it seems like RequeueAfter is cancelled if an event triggers a reconcile.

I agree that 5 seconds may be the wrong value here; we should tweak this later.

Rename startProgressTimeout to progressTimeout and repurpose it to
detect workspaces stuck in the "Failing" state for too long as well.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm label Sep 27, 2021
Copy link
Member

@sleshchenko sleshchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any reason to delay delivering this valuable feature.
Approving and if I find any way to improve/optimize(if it's really needed) implementation - I'll raise a dedicated issue.

@openshift-ci
Copy link

openshift-ci bot commented Sep 28, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amisevsk, JPinkney, sleshchenko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [JPinkney,amisevsk,sleshchenko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sleshchenko
Copy link
Member

/test v8-devworkspace-operator-e2e, v8-che-happy-path

@openshift-ci
Copy link

openshift-ci bot commented Sep 28, 2021

@amisevsk: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/v8-che-happy-path 9479380 link true /test v8-che-happy-path

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sleshchenko
Copy link
Member

Che Happy path fails with
Screenshot_20210928_180850

I assume it's caused by a health check that understands redirect as success.
So, the health endpoint should be added to allowlist on Che side, it should not be a blocker for this PR.
But also we can make DWO respect only 2XX response codes and don't follow redirect links, but it's another story.

@Divine1
Copy link

Divine1 commented Jul 8, 2022

hi team

i have installed eclipse-che in k8s cluster. i'm facing below timeout error.
Please let me know how can i increase the timeout limit here?
should i add StartProgressTimeout in devfile.yaml? please help me on this

image

i tried increasing the timeout in checluster, but it has no effect.

apiVersion: org.eclipse.che/v2
kind: CheCluster
metadata:
  name: eclipse-che
spec:
  components:
    cheServer:
      extraProperties:
        CHE_INFRA_KUBERNETES_WORKSPACE__START__TIMEOUT__MIN: "15"
    database:
      externalDb: true
      postgresHostName: sc2-10-186-67-195.eng.vmware.com
      postgresPort: "5432"
  networking:
    auth:
      identityProviderURL: https://dexvmware.com
      oAuthClientName: eclipse-che
      oAuthSecret: sdsdsdsdsd

@sleshchenko

@amisevsk

@dmytro-ndp

@amisevsk
Copy link
Collaborator Author

amisevsk commented Jul 8, 2022

Hi @Divine1 -- that error will occur when the DevWorkspace does not reach a "Running" state before the start timeout (introduced in this PR). By default this is 5 minutes but can be increased by setting the .config.workspace.progressTimeout field in the DevWorkspaceOperatorConfig, e.g.:

apiVersion: controller.devfile.io/v1alpha1
kind: DevWorkspaceOperatorConfig
metadata:
  name: devworkspace-operator-config
  namespace: <devworkspace/che installation namespace>
config:
  workspace:
    progressTimeout: 15m # This sets the timeout to 15 minutes

However, if it's taking longer than 5 minutes for your workspace to start, that suggests a different problem. Do you have any information about the workspace or its pods while it's starting? Some things to check:

  • What phase does the workspace get stuck on (kubectl get dw <workspace-name>)
  • Is the workspace deployment running/ready? If so, is there anything in the logs there?

From here, we can start to narrow down what's causing the issue.

Feel free to open another issue in this repo and we can discuss there.

@tolusha
Copy link
Contributor

tolusha commented Aug 18, 2022

@amisevsk @Divine1
So, the problem is the following. It takes more than 5m to pull images which causes workspace fail to start.
What are possible solutions?

@Divine1
Copy link

Divine1 commented Aug 18, 2022

@tolusha Thank you for checking this

i have posted logs in below link which show failing start workspace within 5m message. Also, is there any issue with project-clone init-container?

detailed logs are in this link eclipse-che/che#21613 (comment)

@amisevsk

@Divine1
Copy link

Divine1 commented Aug 18, 2022

@tolusha i'm able to see errors in kubectl get events -n workspace-namespace logs . if increasing timeout is the solution, then there shouldn't be other errors in the kubectl get events -n workspace-namespace logs, right?

One more doubt... i noticed, the workspace deployment component is applied immediately after the PVC is issued, but more seconds is needed to provision a PV. Should there be any delay between applying workspace deployment and PV provisioning?

@amisevsk
Copy link
Collaborator Author

If the issue is that pulling images takes longer than 5 minutes, then the solutions would be either

  1. Increase the timeout as specified in Introduce timeout for hanging workspaces #605 (comment) (I don't believe the dashboard is failing workspaces, and only the duration in the error is hard-coded)
  2. Install and configure the Kubernetes Image Puller in the cluster to automatically pull images before workspaces are started.
  3. Manually ensure every node in the cluster has the images required (the approach the image puller uses is to create a daemonset to run a pod with those images on every node)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce timing out for DevWorkspace to prevent hanging on any of startup phase
5 participants